PROJECT: Analyze the European Soccer Database¶

Table of Contents¶

  • Introduction
  • Data Wrangling
  • Exploratory Data Analysis
  • What teams improved the most over the time period?
  • Which players had the most penalties?
  • What team attributes lead to the most victories?
    • Logistic Regression Model
  • Conclusions
  • References

Introduction¶

This dataset contains:

  • information of 11 soccer leagues of different european countries.
  • data of +25,000 matches played
  • data of +11,000 players
  • seasons from 2008 to 2016
  • Players and Teams' attributes

The main questions to be answered for this project are:

  1. What teams improved the most over the time period?
  2. Which players had the most penalties?
  3. What team attributes lead to the most victories?
In [1]:
import pandas as pd
import numpy as np
import sqlite3

from datetime import datetime
from collections import Counter

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler

import plotly.graph_objects as go
import plotly.express as px
In [2]:
con = sqlite3.connect("./database.sqlite")
In [3]:
def read_table(table_name, con):
    return pd.read_sql_query(f"SELECT * FROM {table_name}", con)

Data Wrangling¶

In [4]:
"""
This function removes from matches dataframe those columns that have null values. It also dismiss the columns
'id' and 'match_api_id' because do not contribute with valuable information.
"""

def remove_unused_columns_matches(matches):
    # print("Initial shape of matches dataframe: ", matches.shape)
    count_nulls = matches.isnull().sum()
    columns_to_remove = count_nulls[count_nulls > 0]
    matches.drop(columns=columns_to_remove.index, inplace=True)
    matches.drop(columns=["id", "match_api_id"], inplace=True)

    # Check the shape is the correct one
    assert matches.shape == (25979, 9)
In [5]:
"""
Function that casts the column 'date' from object type to datetime type.
"""

def convert_date_format(matches):
    matches["date"] = matches["date"].astype("datetime64[ns]")
    
    assert matches.dtypes["date"] == np.dtype("<M8[ns]")
In [6]:
"""
Function that adds the column 'country' to matches dataframe.
"""

def add_country_to_matches(matches, countries):
    # Transform column 'country_id' to 'country' using the name of the countries
    countries_dict = dict([(id, name) for id, name in zip(countries["id"], countries["name"])])
    matches["country"] = matches["country_id"].map(countries_dict)
    matches.drop(columns=["country_id"], inplace=True)

    # Check the final dataframe has a column 'country' and does not have a column 'country_id'
    assert matches.shape == (25979, 9)
    assert "country" in matches.columns
    assert "country_id" not in matches.columns
In [7]:
"""
Function that replaces the league IDs with league's names.
"""

def add_league_column_to_matches(matches, leagues):
    leagues_dict = dict([id, name] for id, name in zip(leagues["id"], leagues["name"]))
    matches["league"] = matches["league_id"].map(leagues_dict)
    matches.drop(columns=["league_id"], inplace=True)

    # Check the final dataframe has a column 'league' and does not have a column 'league_id'
    assert matches.shape == (25979, 9)
    assert "league" in matches.columns
    assert "league_id" not in matches.columns
In [8]:
"""
Replaces the teams' IDs with the teams' names.
"""

def add_team_names_to_matches(matches, teams):
    teams_dict = dict([id, name] for id, name in zip(teams["team_api_id"], teams["team_long_name"]))
    matches["home_team"] = matches["home_team_api_id"].map(teams_dict)
    matches["away_team"] = matches["away_team_api_id"].map(teams_dict)
    matches.drop(columns=["home_team_api_id", "away_team_api_id"], inplace=True)
    
    # Check the final dataframe has two new columns 'home_team' and 'away_team'
    assert matches.shape == (25979, 9)
    assert {"home_team", "away_team"}.issubset(matches.columns)
    assert not {"home_team_api_id", "away_team_api_id"}.issubset(matches.columns)

The following function is a wrangling and cleaning function applied to matches dataframe. Its purpose is to prepare matches dataframe for the analysis.

In [9]:
def data_wrangling_matches(matches, countries, leagues, teams):
    remove_unused_columns_matches(matches)
    convert_date_format(matches)
    add_country_to_matches(matches, countries)
    add_league_column_to_matches(matches, leagues)
    add_team_names_to_matches(matches, teams)
    

These are the queries to read theee tables as we have them stored in the sqlite database.

In [10]:
countries = read_table(table_name="Country", con=con)
leagues = read_table(table_name="League", con=con)
matches = read_table(table_name="Match", con=con)
teams = read_table(table_name="Team", con=con)

First, data_wrangling_matches function defined two cells above is applied to the matches dataframe, and here is the result of that process.

In [11]:
data_wrangling_matches(matches, countries, leagues, teams)
matches
Out[11]:
season stage date home_team_goal away_team_goal country league home_team away_team
0 2008/2009 1 2008-08-17 1 1 Belgium Belgium Jupiler League KRC Genk Beerschot AC
1 2008/2009 1 2008-08-16 0 0 Belgium Belgium Jupiler League SV Zulte-Waregem Sporting Lokeren
2 2008/2009 1 2008-08-16 0 3 Belgium Belgium Jupiler League KSV Cercle Brugge RSC Anderlecht
3 2008/2009 1 2008-08-17 5 0 Belgium Belgium Jupiler League KAA Gent RAEC Mons
4 2008/2009 1 2008-08-16 1 3 Belgium Belgium Jupiler League FCV Dender EH Standard de Liège
... ... ... ... ... ... ... ... ... ...
25974 2015/2016 9 2015-09-22 1 0 Switzerland Switzerland Super League FC St. Gallen FC Thun
25975 2015/2016 9 2015-09-23 1 2 Switzerland Switzerland Super League FC Vaduz FC Luzern
25976 2015/2016 9 2015-09-23 2 0 Switzerland Switzerland Super League Grasshopper Club Zürich FC Sion
25977 2015/2016 9 2015-09-22 0 0 Switzerland Switzerland Super League Lugano FC Zürich
25978 2015/2016 9 2015-09-23 4 3 Switzerland Switzerland Super League BSC Young Boys FC Basel

25979 rows × 9 columns

To verify the non-nulls values of the dataframe it is applied the info() function.

In [12]:
matches.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25979 entries, 0 to 25978
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   season          25979 non-null  object        
 1   stage           25979 non-null  int64         
 2   date            25979 non-null  datetime64[ns]
 3   home_team_goal  25979 non-null  int64         
 4   away_team_goal  25979 non-null  int64         
 5   country         25979 non-null  object        
 6   league          25979 non-null  object        
 7   home_team       25979 non-null  object        
 8   away_team       25979 non-null  object        
dtypes: datetime64[ns](1), int64(3), object(5)
memory usage: 1.8+ MB

Once everything is cleaned and ordered, it is time to start answering the first question: 'What teams improved the most over the time period?'

Before we answer this question we can plot the total goals scored per season per team.

In [13]:
goals_for = pd.concat(
    [
        matches[["season", "home_team", "home_team_goal"]].rename(columns={"home_team": "team", "home_team_goal": "for"}),
        matches[["season", "away_team", "away_team_goal"]].rename(columns={"away_team": "team", "away_team_goal": "for"}),
    ]
)
goals_for = goals_for.groupby(["season", "team"]).sum()

goals_against = pd.concat(
    [
        matches[["season", "home_team", "away_team_goal"]].rename(columns={"home_team": "team", "away_team_goal": "against"}),
        matches[["season", "away_team", "home_team_goal"]].rename(columns={"away_team": "team", "home_team_goal": "against"}),
    ]
)
goals_against = goals_against.groupby(["season", "team"]).sum()

goals = pd.merge(
    goals_for.reset_index(),
    goals_against.reset_index(),
    on=["season", "team"],
)
goals["gd"] = goals["for"] - goals["against"]
goals.sort_values(by=["season"], inplace=True, ignore_index=True)
goals
Out[13]:
season team for against gd
0 2008/2009 1. FC Köln 35 50 -15
1 2008/2009 P. Warszawa 40 23 17
2 2008/2009 PSV 71 33 38
3 2008/2009 Palermo 57 50 7
4 2008/2009 Paris Saint-Germain 49 38 11
... ... ... ... ... ...
1473 2015/2016 Frosinone 35 76 -41
1474 2015/2016 GFC Ajaccio 37 58 -21
1475 2015/2016 Genoa 45 48 -3
1476 2015/2016 FC Thun 45 54 -9
1477 2015/2016 Śląsk Wrocław 28 37 -9

1478 rows × 5 columns

Exploratory Data Analysis¶

The next plot shows the total goals scored along the 8 seasons.

In [14]:
fig = go.Figure()

for team, group in goals.groupby(["team"]):
    if group.shape[0] == 8:
        group.sort_values(by=["season"], inplace=True)
        fig.add_trace(go.Scatter(x=group["season"], y=group["for"], name=team, connectgaps=False))
fig.update_layout(
    title="Total goals per team over 8 seasons",
    xaxis_title="Seasons",
    yaxis_title="Goals",
    legend_title_text="Teams",
)
fig.show()

From this plot it is clear that teams like FC Barcelona and Real Madrid CF are the ones that lead the goal scoring of all clubs registered on this European Soccer database. However that does not imply that they have improved over time. So, it is necessary to find another way to represent improvement.

One proposal could be calculating the points each team gets after each match. In order to do that two extra columns are going to be added to the matches dataframe: home_team_points and away_team_points. They will have the following values: if a team won a game it gets three points, if it losses the game zero points, otherwise both teams get one point.

In [15]:
condlist = [
    np.array(matches["home_team_goal"] > matches["away_team_goal"]).reshape(-1, 1),
    np.array(matches["home_team_goal"] < matches["away_team_goal"]).reshape(-1, 1),
]

choicelist = [[3,0], [0, 3]]
points = pd.DataFrame(
    np.select(condlist=condlist, choicelist=choicelist, default=[1, 1]),
    columns=["home_team_points", "away_team_points"],
)
matches = pd.concat(
    [
        matches,
        points,
    ],
    axis=1,
)
matches
Out[15]:
season stage date home_team_goal away_team_goal country league home_team away_team home_team_points away_team_points
0 2008/2009 1 2008-08-17 1 1 Belgium Belgium Jupiler League KRC Genk Beerschot AC 1 1
1 2008/2009 1 2008-08-16 0 0 Belgium Belgium Jupiler League SV Zulte-Waregem Sporting Lokeren 1 1
2 2008/2009 1 2008-08-16 0 3 Belgium Belgium Jupiler League KSV Cercle Brugge RSC Anderlecht 0 3
3 2008/2009 1 2008-08-17 5 0 Belgium Belgium Jupiler League KAA Gent RAEC Mons 3 0
4 2008/2009 1 2008-08-16 1 3 Belgium Belgium Jupiler League FCV Dender EH Standard de Liège 0 3
... ... ... ... ... ... ... ... ... ... ... ...
25974 2015/2016 9 2015-09-22 1 0 Switzerland Switzerland Super League FC St. Gallen FC Thun 3 0
25975 2015/2016 9 2015-09-23 1 2 Switzerland Switzerland Super League FC Vaduz FC Luzern 0 3
25976 2015/2016 9 2015-09-23 2 0 Switzerland Switzerland Super League Grasshopper Club Zürich FC Sion 3 0
25977 2015/2016 9 2015-09-22 0 0 Switzerland Switzerland Super League Lugano FC Zürich 1 1
25978 2015/2016 9 2015-09-23 4 3 Switzerland Switzerland Super League BSC Young Boys FC Basel 3 0

25979 rows × 11 columns

Once the points per match were calculated, given how data was stored, we have to reorder it to just get the points per team.

In [16]:
points = pd.concat(
    [
        matches[["country", "season", "date", "home_team", "home_team_points"]].rename(columns={"home_team": "team", "home_team_points": "points"}),
        matches[["country", "season", "date", "away_team", "away_team_points"]].rename(columns={"away_team": "team", "away_team_points": "points"}),
    ],
    ignore_index=True,
)
points.sort_values(by=["team", "date"], ignore_index=True, inplace=True)
points
Out[16]:
country season date team points
0 Germany 2010/2011 2010-08-21 1. FC Kaiserslautern 3
1 Germany 2010/2011 2010-08-27 1. FC Kaiserslautern 3
2 Germany 2010/2011 2010-09-12 1. FC Kaiserslautern 0
3 Germany 2010/2011 2010-09-18 1. FC Kaiserslautern 1
4 Germany 2010/2011 2010-09-22 1. FC Kaiserslautern 0
... ... ... ... ... ...
51953 Poland 2015/2016 2016-03-08 Śląsk Wrocław 0
51954 Poland 2015/2016 2016-03-11 Śląsk Wrocław 1
51955 Poland 2015/2016 2016-03-19 Śląsk Wrocław 1
51956 Poland 2015/2016 2016-04-01 Śląsk Wrocław 3
51957 Poland 2015/2016 2016-04-09 Śląsk Wrocław 3

51958 rows × 5 columns

Let's plot the points over time for the teams that scored more goals: FC Barcelona and Real Madrid CF.

In [17]:
barcelona = points.loc[points["team"] == "FC Barcelona", ["date", "points"]]
barcelona.sort_values(by=["date"], inplace=True, ignore_index=True)
real_madrid = points.loc[points["team"] == "Real Madrid CF", ["date", "points"]]
real_madrid.sort_values(by=["date"], inplace=True, ignore_index=True)

fig = go.Figure()
fig.add_trace(go.Scatter(x=barcelona.index, y=barcelona["points"].cumsum(), name="FC Barcelona"))
fig.add_trace(go.Scatter(x=real_madrid.index, y=real_madrid["points"].cumsum(), name="Real Madrid CF"))
fig.update_layout(
    title="Points accumulated over all matches",
    xaxis_title="matches",
    yaxis_title="points",
)
fig.show()

This representation does not tell us clearly whether the teams are improving or not, because at least for this two teams, they are on top rankings all the time. So, another strategy could be to check how they are improving during a season, and once we have that number it's possible to calculate the general improvement over all seasons.

What teams improved the most over the time period?¶

How to represent the improvement over time? It is possible to calculate the linear trend of the points that teams are getting on each match. The linear trend is going to be the linear regression model of the average of points gotten in three consecutive matches. More specifically, we'll use the regression coefficient as the slope of the linear regression model, so if the slope is positive, that means there is a performance improvement, otherwise the team performance is getting worse over time.

In [18]:
slopes = points[["country", "team"]].drop_duplicates().set_index("team")
In [19]:
groups = points.groupby(["season", "team"])
slopes = points[["country", "team"]].drop_duplicates().set_index("team")

for group_name, team_points in groups:
    team_points = team_points.groupby(np.arange(len(team_points)) // 3).mean()
    X = np.arange(len(team_points)).reshape(-1, 1)
    y = team_points["points"].to_numpy()
    regressor = LinearRegression()
    regressor.fit(X, y)
    slopes.loc[group_name[1], group_name[0]] = regressor.coef_

slopes.head(5)
Out[19]:
country 2008/2009 2009/2010 2010/2011 2011/2012 2012/2013 2013/2014 2014/2015 2015/2016
team
1. FC Kaiserslautern Germany NaN NaN 0.144522 -0.075758 NaN NaN NaN NaN
1. FC Köln Germany -0.050117 0.002331 0.156177 -0.114219 NaN NaN -0.006993 -0.038462
1. FC Nürnberg Germany NaN 0.096737 -0.045455 -0.004662 0.086247 -0.046620 NaN NaN
1. FSV Mainz 05 Germany NaN -0.036131 -0.023310 -0.019814 -0.072261 0.057110 -0.044289 -0.025641
AC Ajaccio France NaN NaN NaN 0.075092 -0.038462 0.029304 NaN NaN

Before further analysis, the dataset shows some non-null values. That's because there are teams that did not play the 8 seasons. Let's remove them, because the analysis will be just for the teams that have the most information available.

In [20]:
slopes.dropna(inplace=True)

Let's define a function that is going to create the linear regression model over all seasons.

In [21]:
def calculate_slope(X, y):
    regressor = LinearRegression()
    regressor.fit(X, y)
    return regressor.coef_[0]

Apply that function to all teams.

In [22]:
seasons = matches["season"].unique()
slopes["general_slope"] = slopes[seasons].apply(
    lambda x: calculate_slope(np.arange(len(x)).reshape(-1, 1), x),
    axis=1,
)
slopes.head(3)
Out[22]:
country 2008/2009 2009/2010 2010/2011 2011/2012 2012/2013 2013/2014 2014/2015 2015/2016 general_slope
team
ADO Den Haag Netherlands 0.041958 -0.030303 -0.065268 -0.076923 -0.074592 0.150350 -0.008159 0.029138 0.007978
AS Saint-Étienne France 0.042125 0.020147 -0.078755 -0.051282 0.018315 0.060440 0.040293 -0.032967 0.000741
AZ Netherlands -0.002331 0.053613 -0.033800 -0.064103 0.001166 -0.054779 0.081585 0.148019 0.014222

Perfect, what if we sort the slopes and get the top 5 teams with the most positive coefficients or slopes. This means that these teams have the best improvement over time.

In [23]:
slopes.sort_values(by=["general_slope"], ascending=False).head(5)
Out[23]:
country 2008/2009 2009/2010 2010/2011 2011/2012 2012/2013 2013/2014 2014/2015 2015/2016 general_slope
team
KAA Gent Belgium 0.010490 0.030303 -0.018182 -0.044444 -0.052525 1.333333e+00 0.094949 -0.012121 0.050136
Bayer 04 Leverkusen Germany -0.143357 -0.143357 0.058275 0.051282 0.068765 -5.244755e-02 -0.015152 0.111888 0.025155
Sunderland England -0.069597 -0.029304 -0.034799 -0.012821 -0.016484 9.340659e-02 0.012821 0.108059 0.021847
TSG 1899 Hoffenheim Germany -0.047786 -0.051282 -0.113054 -0.075758 0.110723 7.925408e-02 -0.006993 0.066434 0.021243
Lechia Gdańsk Poland -0.024242 -0.115152 -0.054545 0.018182 -0.096970 -1.222315e-17 0.026263 0.086869 0.018254

And the 5 teams with the lowest slope value. Which means they are decreasing their performance over time.

In [24]:
slopes.sort_values(by=["general_slope"]).head(5)
Out[24]:
country 2008/2009 2009/2010 2010/2011 2011/2012 2012/2013 2013/2014 2014/2015 2015/2016 general_slope
team
FC Bayern Munich Germany 0.066434 0.062937 0.120047 0.031469 0.024476 -0.027972 -0.047786 -0.023310 -0.019439
Ajax Netherlands 0.004662 0.089744 0.061772 0.135198 0.093240 -0.010490 -0.080420 -0.053613 -0.018065
Atlético Madrid Spain 0.086081 0.018315 0.040293 0.071429 -0.087912 -0.087912 -0.065934 0.012821 -0.017595
Everton England 0.067766 0.104396 0.080586 0.071429 -0.012821 -0.001832 0.034799 -0.038462 -0.016941
Roma Italy 0.064103 0.128205 0.020147 0.031136 0.029304 -0.102564 -0.084249 0.076923 -0.015982

Let's plot the improvement over all seasons for the teams that improved the most.

In [25]:
top_5 = slopes.sort_values(by=["general_slope"], ascending=False).head(3)
fig = go.Figure()
for team in top_5.index.to_list():
    fig.add_trace(go.Scatter(x=seasons, y=slopes[seasons].loc[team], name=team))

fig.update_layout(
    title="Teams with best improvement along all seasons",
    xaxis_title="Seasons",
    yaxis_title="Slopes",
    legend_title="Teams",
    font=dict(
        family="Courier New",
        size=14,
        color="RebeccaPurple",
    )
)
fig.show()

Now, plot the improvement for the teams that have the most negative slopes over all seasons.

In [26]:
last_5 = slopes.sort_values(by=["general_slope"]).head(3)
fig = go.Figure()
for team in last_5.index.to_list():
    fig.add_trace(go.Scatter(x=seasons, y=slopes[seasons].loc[team], name=team))

fig.update_layout(
    title="Teams with worst improvement along all seasons",
    xaxis_title="Seasons",
    yaxis_title="Slopes",
    legend_title="Teams",
    font=dict(
        family="Courier New",
        size=14,
        color="RebeccaPurple",
    )
)
fig.show()

Which players had the most penalties?¶

The next query retrieves information from two tables Player and Player_Attributes, because the second table contains the penalties data.

In [27]:
players_attributes = pd.read_sql_query("""
    SELECT 
        p.player_name
        , p.birthday
        , p.height
        , p.weight
        , pa.*
    FROM Player as p
    LEFT JOIN Player_Attributes pa on p.player_api_id = pa.player_api_id;
    """,
    con=con,
)
players_attributes
Out[27]:
player_name birthday height weight id player_fifa_api_id player_api_id date overall_rating potential ... vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
0 Aaron Appindangoye 1992-02-29 00:00:00 182.88 187 5 218353 505942 2007-02-22 00:00:00 61.0 65.0 ... 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0
1 Aaron Appindangoye 1992-02-29 00:00:00 182.88 187 4 218353 505942 2015-03-20 00:00:00 61.0 65.0 ... 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0
2 Aaron Appindangoye 1992-02-29 00:00:00 182.88 187 3 218353 505942 2015-09-21 00:00:00 62.0 66.0 ... 54.0 48.0 65.0 66.0 69.0 6.0 11.0 10.0 8.0 8.0
3 Aaron Appindangoye 1992-02-29 00:00:00 182.88 187 2 218353 505942 2015-11-19 00:00:00 67.0 71.0 ... 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
4 Aaron Appindangoye 1992-02-29 00:00:00 182.88 187 1 218353 505942 2016-02-18 00:00:00 67.0 71.0 ... 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
183973 Zvjezdan Misimovic 1982-06-05 00:00:00 180.34 176 183973 102359 39902 2010-08-30 00:00:00 83.0 85.0 ... 88.0 73.0 22.0 31.0 30.0 9.0 13.0 8.0 15.0 5.0
183974 Zvjezdan Misimovic 1982-06-05 00:00:00 180.34 176 183972 102359 39902 2011-02-22 00:00:00 81.0 85.0 ... 87.0 73.0 22.0 31.0 30.0 9.0 13.0 8.0 15.0 5.0
183975 Zvjezdan Misimovic 1982-06-05 00:00:00 180.34 176 183971 102359 39902 2011-08-30 00:00:00 81.0 81.0 ... 87.0 78.0 22.0 31.0 30.0 9.0 13.0 8.0 15.0 5.0
183976 Zvjezdan Misimovic 1982-06-05 00:00:00 180.34 176 183970 102359 39902 2012-02-22 00:00:00 81.0 81.0 ... 87.0 78.0 22.0 31.0 30.0 9.0 13.0 8.0 15.0 5.0
183977 Zvjezdan Misimovic 1982-06-05 00:00:00 180.34 176 183969 102359 39902 2012-08-31 00:00:00 78.0 78.0 ... 82.0 78.0 22.0 31.0 30.0 9.0 13.0 8.0 15.0 5.0

183978 rows × 46 columns

Let's check how many non-null values exist for that column.

In [28]:
players_attributes["penalties"].isnull().sum()
Out[28]:
836

Let's remove those rows that contain non-null values.

In [29]:
players_attributes.dropna(subset=["penalties"], inplace=True)
players_attributes
Out[29]:
player_name birthday height weight id player_fifa_api_id player_api_id date overall_rating potential ... vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
0 Aaron Appindangoye 1992-02-29 00:00:00 182.88 187 5 218353 505942 2007-02-22 00:00:00 61.0 65.0 ... 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0
1 Aaron Appindangoye 1992-02-29 00:00:00 182.88 187 4 218353 505942 2015-03-20 00:00:00 61.0 65.0 ... 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0
2 Aaron Appindangoye 1992-02-29 00:00:00 182.88 187 3 218353 505942 2015-09-21 00:00:00 62.0 66.0 ... 54.0 48.0 65.0 66.0 69.0 6.0 11.0 10.0 8.0 8.0
3 Aaron Appindangoye 1992-02-29 00:00:00 182.88 187 2 218353 505942 2015-11-19 00:00:00 67.0 71.0 ... 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
4 Aaron Appindangoye 1992-02-29 00:00:00 182.88 187 1 218353 505942 2016-02-18 00:00:00 67.0 71.0 ... 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
183973 Zvjezdan Misimovic 1982-06-05 00:00:00 180.34 176 183973 102359 39902 2010-08-30 00:00:00 83.0 85.0 ... 88.0 73.0 22.0 31.0 30.0 9.0 13.0 8.0 15.0 5.0
183974 Zvjezdan Misimovic 1982-06-05 00:00:00 180.34 176 183972 102359 39902 2011-02-22 00:00:00 81.0 85.0 ... 87.0 73.0 22.0 31.0 30.0 9.0 13.0 8.0 15.0 5.0
183975 Zvjezdan Misimovic 1982-06-05 00:00:00 180.34 176 183971 102359 39902 2011-08-30 00:00:00 81.0 81.0 ... 87.0 78.0 22.0 31.0 30.0 9.0 13.0 8.0 15.0 5.0
183976 Zvjezdan Misimovic 1982-06-05 00:00:00 180.34 176 183970 102359 39902 2012-02-22 00:00:00 81.0 81.0 ... 87.0 78.0 22.0 31.0 30.0 9.0 13.0 8.0 15.0 5.0
183977 Zvjezdan Misimovic 1982-06-05 00:00:00 180.34 176 183969 102359 39902 2012-08-31 00:00:00 78.0 78.0 ... 82.0 78.0 22.0 31.0 30.0 9.0 13.0 8.0 15.0 5.0

183142 rows × 46 columns

Since there are multiple records for each player, the dataframe can be grouped by player and get the maximum penalties registered for each player. Then, it is possible to sort the results starting from the maximum number of penalties.

In [30]:
players_penalties = players_attributes.groupby("player_name")["penalties"].max()
players_penalties.sort_values(ascending=False)
Out[30]:
player_name
Rickie Lambert            96.0
Andrea Pirlo              95.0
Xavi Hernandez            95.0
Mario Balotelli           95.0
Paul Scholes              95.0
                          ... 
Igor Stefanovic           11.0
Giedrius Arlauskis        11.0
Jakub Szumski             11.0
Timothy van der Meulen    10.0
Jakub Divis                9.0
Name: penalties, Length: 10848, dtype: float64

Rickie Lambert is the player with the most penalties registered.

What team attributes lead to the most victories?¶

First, it is important to mention that the tables involved to answer this question are Teams and Teams_Attributes from European Soccer database.

We've already read the Teams table, let's explore it and do the necessary cleaness process.

In [31]:
teams
Out[31]:
id team_api_id team_fifa_api_id team_long_name team_short_name
0 1 9987 673.0 KRC Genk GEN
1 2 9993 675.0 Beerschot AC BAC
2 3 10000 15005.0 SV Zulte-Waregem ZUL
3 4 9994 2007.0 Sporting Lokeren LOK
4 5 9984 1750.0 KSV Cercle Brugge CEB
... ... ... ... ... ...
294 49479 10190 898.0 FC St. Gallen GAL
295 49837 10191 1715.0 FC Thun THU
296 50201 9777 324.0 Servette FC SER
297 50204 7730 1862.0 FC Lausanne-Sports LAU
298 51606 7896 NaN Lugano LUG

299 rows × 5 columns

Drop some columns from teams dataframe that are not necessary for our analysis.

In [32]:
teams.drop(columns=["id", "team_fifa_api_id", "team_short_name"], inplace=True)
teams
Out[32]:
team_api_id team_long_name
0 9987 KRC Genk
1 9993 Beerschot AC
2 10000 SV Zulte-Waregem
3 9994 Sporting Lokeren
4 9984 KSV Cercle Brugge
... ... ...
294 10190 FC St. Gallen
295 10191 FC Thun
296 9777 Servette FC
297 7730 FC Lausanne-Sports
298 7896 Lugano

299 rows × 2 columns

Check the NA values and uniqueness for this dataset

In [33]:
teams.isna().sum()
Out[33]:
team_api_id       0
team_long_name    0
dtype: int64

With this we verify that we are working with a dataset with non-null values.

Let's check the uniqueness of the values. For this dataset it is important to verify the uniqueness since we are working with IDs and obviously all of them should be different in the first place.

In [34]:
teams.apply(lambda x: print(f"'{x.name}' is unique: {x.is_unique}"))
'team_api_id' is unique: True
'team_long_name' is unique: False
Out[34]:
team_api_id       None
team_long_name    None
dtype: object

Given the results, it is possible to see the IDs are different but there are duplicated team's names, which means some teams were stored with different IDs. Let's check that out!

In [35]:
teams.loc[teams["team_long_name"].duplicated(), :]
Out[35]:
team_api_id team_long_name
24 274581 Royal Excel Mouscron
183 8020 Polonia Bytom
199 8024 Widzew Łódź

It was identified which teams are duplicated. Let's analyze their data in the teams_attributes dataframe, because this data is related with teams dataframe through the team_api_id.

Let's get the data from Team_Attributes table and save it as a dataframe.

In [36]:
teams_attributes = pd.read_sql_query("SELECT * FROM Team_Attributes", con=con)
teams_attributes
Out[36]:
id team_fifa_api_id team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribbling buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 1 434 9930 2010-02-22 00:00:00 60 Balanced NaN Little 50 Mixed ... 55 Normal Organised 50 Medium 55 Press 45 Normal Cover
1 2 434 9930 2014-09-19 00:00:00 52 Balanced 48.0 Normal 56 Mixed ... 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
2 3 434 9930 2015-09-10 00:00:00 47 Balanced 41.0 Normal 54 Mixed ... 64 Normal Organised 47 Medium 44 Press 54 Normal Cover
3 4 77 8485 2010-02-22 00:00:00 70 Fast NaN Little 70 Long ... 70 Lots Organised 60 Medium 70 Double 70 Wide Cover
4 5 77 8485 2011-02-22 00:00:00 47 Balanced NaN Little 52 Mixed ... 52 Normal Organised 47 Medium 47 Press 52 Normal Cover
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1453 1454 15005 10000 2011-02-22 00:00:00 52 Balanced NaN Little 52 Mixed ... 53 Normal Organised 46 Medium 48 Press 53 Normal Cover
1454 1455 15005 10000 2012-02-22 00:00:00 54 Balanced NaN Little 51 Mixed ... 50 Normal Organised 44 Medium 55 Press 53 Normal Cover
1455 1456 15005 10000 2013-09-20 00:00:00 54 Balanced NaN Little 51 Mixed ... 32 Little Organised 44 Medium 58 Press 37 Normal Cover
1456 1457 15005 10000 2014-09-19 00:00:00 54 Balanced 42.0 Normal 51 Mixed ... 32 Little Organised 44 Medium 58 Press 37 Normal Cover
1457 1458 15005 10000 2015-09-10 00:00:00 54 Balanced 42.0 Normal 51 Mixed ... 32 Little Organised 44 Medium 58 Press 37 Normal Cover

1458 rows × 25 columns

Next, to more easily identify which team the attributes belong to, a column with the team's name will be added to teams_attributes dataframe.

In [37]:
teams_attributes = pd.merge(
    teams,
    teams_attributes,
    on=["team_api_id"],
)
teams_attributes
Out[37]:
team_api_id team_long_name id team_fifa_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribbling buildUpPlayDribblingClass buildUpPlayPassing ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 9987 KRC Genk 485 673 2010-02-22 00:00:00 45 Balanced NaN Little 45 ... 60 Normal Organised 70 High 65 Press 70 Wide Cover
1 9987 KRC Genk 486 673 2011-02-22 00:00:00 66 Balanced NaN Little 52 ... 51 Normal Organised 48 Medium 47 Press 54 Normal Offside Trap
2 9987 KRC Genk 487 673 2012-02-22 00:00:00 53 Balanced NaN Little 55 ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
3 9987 KRC Genk 488 673 2013-09-20 00:00:00 58 Balanced NaN Little 38 ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
4 9987 KRC Genk 489 673 2014-09-19 00:00:00 58 Balanced 52.0 Normal 38 ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1453 9777 Servette FC 1156 324 2013-09-20 00:00:00 48 Balanced NaN Little 51 ... 45 Normal Organised 49 Medium 45 Press 49 Normal Cover
1454 7730 FC Lausanne-Sports 672 1862 2010-02-22 00:00:00 30 Slow NaN Little 60 ... 60 Normal Organised 55 Medium 60 Press 50 Normal Cover
1455 7730 FC Lausanne-Sports 673 1862 2012-02-22 00:00:00 37 Balanced NaN Little 49 ... 48 Normal Organised 43 Medium 43 Press 55 Normal Cover
1456 7730 FC Lausanne-Sports 674 1862 2013-09-20 00:00:00 51 Balanced NaN Little 49 ... 48 Normal Organised 43 Medium 43 Press 55 Normal Cover
1457 7730 FC Lausanne-Sports 675 1862 2014-09-19 00:00:00 37 Balanced 55.0 Normal 49 ... 52 Normal Organised 46 Medium 44 Press 55 Normal Cover

1458 rows × 26 columns

The column team_long_name will be renamed to team for easy management.

In [38]:
teams_attributes.rename(columns={"team_long_name": "team"}, inplace=True)
teams_attributes
Out[38]:
team_api_id team id team_fifa_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribbling buildUpPlayDribblingClass buildUpPlayPassing ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 9987 KRC Genk 485 673 2010-02-22 00:00:00 45 Balanced NaN Little 45 ... 60 Normal Organised 70 High 65 Press 70 Wide Cover
1 9987 KRC Genk 486 673 2011-02-22 00:00:00 66 Balanced NaN Little 52 ... 51 Normal Organised 48 Medium 47 Press 54 Normal Offside Trap
2 9987 KRC Genk 487 673 2012-02-22 00:00:00 53 Balanced NaN Little 55 ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
3 9987 KRC Genk 488 673 2013-09-20 00:00:00 58 Balanced NaN Little 38 ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
4 9987 KRC Genk 489 673 2014-09-19 00:00:00 58 Balanced 52.0 Normal 38 ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1453 9777 Servette FC 1156 324 2013-09-20 00:00:00 48 Balanced NaN Little 51 ... 45 Normal Organised 49 Medium 45 Press 49 Normal Cover
1454 7730 FC Lausanne-Sports 672 1862 2010-02-22 00:00:00 30 Slow NaN Little 60 ... 60 Normal Organised 55 Medium 60 Press 50 Normal Cover
1455 7730 FC Lausanne-Sports 673 1862 2012-02-22 00:00:00 37 Balanced NaN Little 49 ... 48 Normal Organised 43 Medium 43 Press 55 Normal Cover
1456 7730 FC Lausanne-Sports 674 1862 2013-09-20 00:00:00 51 Balanced NaN Little 49 ... 48 Normal Organised 43 Medium 43 Press 55 Normal Cover
1457 7730 FC Lausanne-Sports 675 1862 2014-09-19 00:00:00 37 Balanced 55.0 Normal 49 ... 52 Normal Organised 46 Medium 44 Press 55 Normal Cover

1458 rows × 26 columns

In order to make easy to analyze data, first, we can check the cleaness of the teams_attributes dataset. The following columns like id and team_fifa_api_id will be removed because these columns do not contain relevant information.

In [39]:
teams_attributes.drop(columns=["id", "team_fifa_api_id"], inplace=True)
teams_attributes
Out[39]:
team_api_id team date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribbling buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass buildUpPlayPositioningClass ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 9987 KRC Genk 2010-02-22 00:00:00 45 Balanced NaN Little 45 Mixed Organised ... 60 Normal Organised 70 High 65 Press 70 Wide Cover
1 9987 KRC Genk 2011-02-22 00:00:00 66 Balanced NaN Little 52 Mixed Organised ... 51 Normal Organised 48 Medium 47 Press 54 Normal Offside Trap
2 9987 KRC Genk 2012-02-22 00:00:00 53 Balanced NaN Little 55 Mixed Organised ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
3 9987 KRC Genk 2013-09-20 00:00:00 58 Balanced NaN Little 38 Mixed Organised ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
4 9987 KRC Genk 2014-09-19 00:00:00 58 Balanced 52.0 Normal 38 Mixed Organised ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1453 9777 Servette FC 2013-09-20 00:00:00 48 Balanced NaN Little 51 Mixed Organised ... 45 Normal Organised 49 Medium 45 Press 49 Normal Cover
1454 7730 FC Lausanne-Sports 2010-02-22 00:00:00 30 Slow NaN Little 60 Mixed Organised ... 60 Normal Organised 55 Medium 60 Press 50 Normal Cover
1455 7730 FC Lausanne-Sports 2012-02-22 00:00:00 37 Balanced NaN Little 49 Mixed Organised ... 48 Normal Organised 43 Medium 43 Press 55 Normal Cover
1456 7730 FC Lausanne-Sports 2013-09-20 00:00:00 51 Balanced NaN Little 49 Mixed Organised ... 48 Normal Organised 43 Medium 43 Press 55 Normal Cover
1457 7730 FC Lausanne-Sports 2014-09-19 00:00:00 37 Balanced 55.0 Normal 49 Mixed Organised ... 52 Normal Organised 46 Medium 44 Press 55 Normal Cover

1458 rows × 24 columns

Let's check the null values of each column.

In [40]:
teams_attributes.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1458 entries, 0 to 1457
Data columns (total 24 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   team_api_id                     1458 non-null   int64  
 1   team                            1458 non-null   object 
 2   date                            1458 non-null   object 
 3   buildUpPlaySpeed                1458 non-null   int64  
 4   buildUpPlaySpeedClass           1458 non-null   object 
 5   buildUpPlayDribbling            489 non-null    float64
 6   buildUpPlayDribblingClass       1458 non-null   object 
 7   buildUpPlayPassing              1458 non-null   int64  
 8   buildUpPlayPassingClass         1458 non-null   object 
 9   buildUpPlayPositioningClass     1458 non-null   object 
 10  chanceCreationPassing           1458 non-null   int64  
 11  chanceCreationPassingClass      1458 non-null   object 
 12  chanceCreationCrossing          1458 non-null   int64  
 13  chanceCreationCrossingClass     1458 non-null   object 
 14  chanceCreationShooting          1458 non-null   int64  
 15  chanceCreationShootingClass     1458 non-null   object 
 16  chanceCreationPositioningClass  1458 non-null   object 
 17  defencePressure                 1458 non-null   int64  
 18  defencePressureClass            1458 non-null   object 
 19  defenceAggression               1458 non-null   int64  
 20  defenceAggressionClass          1458 non-null   object 
 21  defenceTeamWidth                1458 non-null   int64  
 22  defenceTeamWidthClass           1458 non-null   object 
 23  defenceDefenderLineClass        1458 non-null   object 
dtypes: float64(1), int64(9), object(14)
memory usage: 284.8+ KB

After displaying teams_attributes dataframe info it is possible to see that column buildUpPlayDribbling has just 489 non-null values, so there is not enough information in that particular column and it will be dismissed in the analysis.

In [41]:
teams_attributes.drop(columns=["buildUpPlayDribbling"], inplace=True)
teams_attributes.shape
Out[41]:
(1458, 23)
In [42]:
teams_attributes
Out[42]:
team_api_id team date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass buildUpPlayPositioningClass chanceCreationPassing ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 9987 KRC Genk 2010-02-22 00:00:00 45 Balanced Little 45 Mixed Organised 50 ... 60 Normal Organised 70 High 65 Press 70 Wide Cover
1 9987 KRC Genk 2011-02-22 00:00:00 66 Balanced Little 52 Mixed Organised 65 ... 51 Normal Organised 48 Medium 47 Press 54 Normal Offside Trap
2 9987 KRC Genk 2012-02-22 00:00:00 53 Balanced Little 55 Mixed Organised 55 ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
3 9987 KRC Genk 2013-09-20 00:00:00 58 Balanced Little 38 Mixed Organised 67 ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
4 9987 KRC Genk 2014-09-19 00:00:00 58 Balanced Normal 38 Mixed Organised 67 ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1453 9777 Servette FC 2013-09-20 00:00:00 48 Balanced Little 51 Mixed Organised 53 ... 45 Normal Organised 49 Medium 45 Press 49 Normal Cover
1454 7730 FC Lausanne-Sports 2010-02-22 00:00:00 30 Slow Little 60 Mixed Organised 40 ... 60 Normal Organised 55 Medium 60 Press 50 Normal Cover
1455 7730 FC Lausanne-Sports 2012-02-22 00:00:00 37 Balanced Little 49 Mixed Organised 52 ... 48 Normal Organised 43 Medium 43 Press 55 Normal Cover
1456 7730 FC Lausanne-Sports 2013-09-20 00:00:00 51 Balanced Little 49 Mixed Organised 52 ... 48 Normal Organised 43 Medium 43 Press 55 Normal Cover
1457 7730 FC Lausanne-Sports 2014-09-19 00:00:00 37 Balanced Normal 49 Mixed Organised 52 ... 52 Normal Organised 46 Medium 44 Press 55 Normal Cover

1458 rows × 23 columns

Now, go back to the analysis of the duplicated teams' names. The first team found duplicated was 'Royal Excel Mouscron'.

In [43]:
teams_attributes[teams_attributes["team"] == "Royal Excel Mouscron"]
Out[43]:
team_api_id team date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass buildUpPlayPositioningClass chanceCreationPassing ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
68 9996 Royal Excel Mouscron 2015-09-10 00:00:00 50 Balanced Normal 50 Mixed Organised 50 ... 50 Normal Organised 45 Medium 45 Press 50 Normal Cover
69 9996 Royal Excel Mouscron 2015-09-10 00:00:00 50 Balanced Normal 50 Mixed Organised 50 ... 50 Normal Organised 45 Medium 45 Press 50 Normal Cover
96 274581 Royal Excel Mouscron 2015-09-10 00:00:00 50 Balanced Normal 50 Mixed Organised 50 ... 50 Normal Organised 45 Medium 45 Press 50 Normal Cover

3 rows × 23 columns

Let's verify if the whole three rows have the same values except witht the column team_api_id:

In [44]:
teams_attributes \
    .loc[:, teams_attributes.columns[1:]] \
        .loc[teams_attributes["team"] == "Royal Excel Mouscron"] \
            .duplicated()
Out[44]:
68    False
69     True
96     True
dtype: bool

Let's remove the rows that are duplicated except for the column team_api_id.

In [45]:
teams_attributes.drop_duplicates(subset=teams_attributes.columns[1:], inplace=True)
teams_attributes
Out[45]:
team_api_id team date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass buildUpPlayPositioningClass chanceCreationPassing ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 9987 KRC Genk 2010-02-22 00:00:00 45 Balanced Little 45 Mixed Organised 50 ... 60 Normal Organised 70 High 65 Press 70 Wide Cover
1 9987 KRC Genk 2011-02-22 00:00:00 66 Balanced Little 52 Mixed Organised 65 ... 51 Normal Organised 48 Medium 47 Press 54 Normal Offside Trap
2 9987 KRC Genk 2012-02-22 00:00:00 53 Balanced Little 55 Mixed Organised 55 ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
3 9987 KRC Genk 2013-09-20 00:00:00 58 Balanced Little 38 Mixed Organised 67 ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
4 9987 KRC Genk 2014-09-19 00:00:00 58 Balanced Normal 38 Mixed Organised 67 ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1453 9777 Servette FC 2013-09-20 00:00:00 48 Balanced Little 51 Mixed Organised 53 ... 45 Normal Organised 49 Medium 45 Press 49 Normal Cover
1454 7730 FC Lausanne-Sports 2010-02-22 00:00:00 30 Slow Little 60 Mixed Organised 40 ... 60 Normal Organised 55 Medium 60 Press 50 Normal Cover
1455 7730 FC Lausanne-Sports 2012-02-22 00:00:00 37 Balanced Little 49 Mixed Organised 52 ... 48 Normal Organised 43 Medium 43 Press 55 Normal Cover
1456 7730 FC Lausanne-Sports 2013-09-20 00:00:00 51 Balanced Little 49 Mixed Organised 52 ... 48 Normal Organised 43 Medium 43 Press 55 Normal Cover
1457 7730 FC Lausanne-Sports 2014-09-19 00:00:00 37 Balanced Normal 49 Mixed Organised 52 ... 52 Normal Organised 46 Medium 44 Press 55 Normal Cover

1450 rows × 23 columns

Finally, it is verified that there are the same unique teams' IDs and teams' names.

In [46]:
teams_attributes["team_api_id"].nunique()
Out[46]:
285
In [47]:
teams_attributes["team"].nunique()
Out[47]:
285
In [48]:
teams_attributes["date"] = pd.to_datetime(teams_attributes["date"], format="%Y-%m-%d")
teams_attributes.dtypes
Out[48]:
team_api_id                                int64
team                                      object
date                              datetime64[ns]
buildUpPlaySpeed                           int64
buildUpPlaySpeedClass                     object
buildUpPlayDribblingClass                 object
buildUpPlayPassing                         int64
buildUpPlayPassingClass                   object
buildUpPlayPositioningClass               object
chanceCreationPassing                      int64
chanceCreationPassingClass                object
chanceCreationCrossing                     int64
chanceCreationCrossingClass               object
chanceCreationShooting                     int64
chanceCreationShootingClass               object
chanceCreationPositioningClass            object
defencePressure                            int64
defencePressureClass                      object
defenceAggression                          int64
defenceAggressionClass                    object
defenceTeamWidth                           int64
defenceTeamWidthClass                     object
defenceDefenderLineClass                  object
dtype: object

For what period of time there are attributes of teams?

In [49]:
teams_attributes["date"].min(), teams_attributes["date"].max()
Out[49]:
(Timestamp('2010-02-22 00:00:00'), Timestamp('2015-09-10 00:00:00'))

Let's explore a little bit the teams_attribute dataset.

Looking at the dataset it is possible to see there are different observations related to one team, so let's check what is the average of observations per team.

In [50]:
teams_attributes.groupby("team")["date"].count().mean()
Out[50]:
5.087719298245614

Now, what's the minimum and the maximum of observations.

In [51]:
teams_attributes.groupby("team")["date"].count().min()
Out[51]:
1
In [52]:
teams_attributes.groupby("team")["date"].count().max()
Out[52]:
6

The next plot represents at which time the observations were recorded.

In [53]:
fig = go.Figure(data=go.Scatter(x=teams_attributes["date"], y=teams_attributes.index, mode="markers"))
fig.update_layout(
    title="Dates where the teams attributes were registered",
    xaxis_title="Date",
    yaxis_title="Observations IDs"
)
fig.show()

As it is shown, it is possible to relate the observations with the seasons played and then relate those attributes with the results in the given matches of the season.

To do that, it's necessary to know when the seasons started and finished per league/country.

In [54]:
seasons = matches.groupby(["season", "country"])["date"].agg(["min", "max"]).reset_index()
seasons
Out[54]:
season country min max
0 2008/2009 Belgium 2008-08-16 2009-05-16
1 2008/2009 England 2008-08-16 2009-05-24
2 2008/2009 France 2008-08-09 2009-05-30
3 2008/2009 Germany 2008-08-15 2009-05-23
4 2008/2009 Italy 2008-08-30 2009-05-31
... ... ... ... ...
83 2015/2016 Poland 2015-07-17 2016-04-09
84 2015/2016 Portugal 2015-08-14 2016-05-15
85 2015/2016 Scotland 2015-08-01 2016-05-15
86 2015/2016 Spain 2015-08-21 2016-05-15
87 2015/2016 Switzerland 2015-07-18 2016-05-25

88 rows × 4 columns

Then, it is necessary to relate each observation to the appropriate seasons. That's for the next function.

In [55]:
def get_season(country, date):
    mask = (seasons["min"] <= date) & (date <= seasons["max"]) & (seasons["country"] == country)
    season = seasons.loc[mask, "season"]
    return season.iloc[0] if len(season) else np.nan
In [56]:
teams_attributes
Out[56]:
team_api_id team date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass buildUpPlayPositioningClass chanceCreationPassing ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 9987 KRC Genk 2010-02-22 45 Balanced Little 45 Mixed Organised 50 ... 60 Normal Organised 70 High 65 Press 70 Wide Cover
1 9987 KRC Genk 2011-02-22 66 Balanced Little 52 Mixed Organised 65 ... 51 Normal Organised 48 Medium 47 Press 54 Normal Offside Trap
2 9987 KRC Genk 2012-02-22 53 Balanced Little 55 Mixed Organised 55 ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
3 9987 KRC Genk 2013-09-20 58 Balanced Little 38 Mixed Organised 67 ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
4 9987 KRC Genk 2014-09-19 58 Balanced Normal 38 Mixed Organised 67 ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1453 9777 Servette FC 2013-09-20 48 Balanced Little 51 Mixed Organised 53 ... 45 Normal Organised 49 Medium 45 Press 49 Normal Cover
1454 7730 FC Lausanne-Sports 2010-02-22 30 Slow Little 60 Mixed Organised 40 ... 60 Normal Organised 55 Medium 60 Press 50 Normal Cover
1455 7730 FC Lausanne-Sports 2012-02-22 37 Balanced Little 49 Mixed Organised 52 ... 48 Normal Organised 43 Medium 43 Press 55 Normal Cover
1456 7730 FC Lausanne-Sports 2013-09-20 51 Balanced Little 49 Mixed Organised 52 ... 48 Normal Organised 43 Medium 43 Press 55 Normal Cover
1457 7730 FC Lausanne-Sports 2014-09-19 37 Balanced Normal 49 Mixed Organised 52 ... 52 Normal Organised 46 Medium 44 Press 55 Normal Cover

1450 rows × 23 columns

In the seasons dataframe (with the information of the start and the end) there are two important pieces of information: the country and the season. On the other hand, in the team_attributes dataframe there is just the date of the observation and team associated with, so it's necessary to add the country to match the information from both datasets.

In [57]:
countries_teams = matches[["country", "home_team"]].drop_duplicates().rename(columns={"home_team": "team"})
teams_attributes = pd.merge(
    countries_teams,
    teams_attributes,
    on=["team"],
)
teams_attributes
Out[57]:
country team team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass buildUpPlayPositioningClass ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 Belgium KRC Genk 9987 2010-02-22 45 Balanced Little 45 Mixed Organised ... 60 Normal Organised 70 High 65 Press 70 Wide Cover
1 Belgium KRC Genk 9987 2011-02-22 66 Balanced Little 52 Mixed Organised ... 51 Normal Organised 48 Medium 47 Press 54 Normal Offside Trap
2 Belgium KRC Genk 9987 2012-02-22 53 Balanced Little 55 Mixed Organised ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
3 Belgium KRC Genk 9987 2013-09-20 58 Balanced Little 38 Mixed Organised ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
4 Belgium KRC Genk 9987 2014-09-19 58 Balanced Normal 38 Mixed Organised ... 56 Normal Organised 47 Medium 45 Press 55 Normal Cover
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1445 Switzerland Servette FC 9777 2013-09-20 48 Balanced Little 51 Mixed Organised ... 45 Normal Organised 49 Medium 45 Press 49 Normal Cover
1446 Switzerland FC Lausanne-Sports 7730 2010-02-22 30 Slow Little 60 Mixed Organised ... 60 Normal Organised 55 Medium 60 Press 50 Normal Cover
1447 Switzerland FC Lausanne-Sports 7730 2012-02-22 37 Balanced Little 49 Mixed Organised ... 48 Normal Organised 43 Medium 43 Press 55 Normal Cover
1448 Switzerland FC Lausanne-Sports 7730 2013-09-20 51 Balanced Little 49 Mixed Organised ... 48 Normal Organised 43 Medium 43 Press 55 Normal Cover
1449 Switzerland FC Lausanne-Sports 7730 2014-09-19 37 Balanced Normal 49 Mixed Organised ... 52 Normal Organised 46 Medium 44 Press 55 Normal Cover

1450 rows × 24 columns

Once there is the country column, it is possible to associate a season to an observation of attributes using the function defined few cells above.

In [58]:
teams_attributes["season"] = teams_attributes.apply(lambda x: get_season(x["country"], x["date"]), axis=1)
teams_attributes
Out[58]:
country team team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribblingClass buildUpPlayPassing buildUpPlayPassingClass buildUpPlayPositioningClass ... chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass season
0 Belgium KRC Genk 9987 2010-02-22 45 Balanced Little 45 Mixed Organised ... Normal Organised 70 High 65 Press 70 Wide Cover 2009/2010
1 Belgium KRC Genk 9987 2011-02-22 66 Balanced Little 52 Mixed Organised ... Normal Organised 48 Medium 47 Press 54 Normal Offside Trap 2010/2011
2 Belgium KRC Genk 9987 2012-02-22 53 Balanced Little 55 Mixed Organised ... Normal Organised 47 Medium 45 Press 55 Normal Cover 2011/2012
3 Belgium KRC Genk 9987 2013-09-20 58 Balanced Little 38 Mixed Organised ... Normal Organised 47 Medium 45 Press 55 Normal Cover NaN
4 Belgium KRC Genk 9987 2014-09-19 58 Balanced Normal 38 Mixed Organised ... Normal Organised 47 Medium 45 Press 55 Normal Cover 2014/2015
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1445 Switzerland Servette FC 9777 2013-09-20 48 Balanced Little 51 Mixed Organised ... Normal Organised 49 Medium 45 Press 49 Normal Cover 2013/2014
1446 Switzerland FC Lausanne-Sports 7730 2010-02-22 30 Slow Little 60 Mixed Organised ... Normal Organised 55 Medium 60 Press 50 Normal Cover 2009/2010
1447 Switzerland FC Lausanne-Sports 7730 2012-02-22 37 Balanced Little 49 Mixed Organised ... Normal Organised 43 Medium 43 Press 55 Normal Cover 2011/2012
1448 Switzerland FC Lausanne-Sports 7730 2013-09-20 51 Balanced Little 49 Mixed Organised ... Normal Organised 43 Medium 43 Press 55 Normal Cover 2013/2014
1449 Switzerland FC Lausanne-Sports 7730 2014-09-19 37 Balanced Normal 49 Mixed Organised ... Normal Organised 46 Medium 44 Press 55 Normal Cover 2014/2015

1450 rows × 25 columns

We have worked with two datasets: teams and teams_attributes. In order to answer the question (to get the most relevant attributes that lead to the most victories), it's necessary to relate those attributes to the results of the matches. We already know which attributes are related to which seasons, now, let's relate those attributes to the results of the matches on those seasons.

The results of the matches are in matches dataframe.

In [59]:
statistics = pd.concat(
    [
        matches[["country", "season", "home_team", "home_team_points"]].rename(columns={"home_team": "team", "home_team_points": "points"}),
        matches[["country", "season", "away_team", "away_team_points"]].rename(columns={"away_team": "team", "away_team_points": "points"}),
    ],
    ignore_index=True,
)
statistics
Out[59]:
country season team points
0 Belgium 2008/2009 KRC Genk 1
1 Belgium 2008/2009 SV Zulte-Waregem 1
2 Belgium 2008/2009 KSV Cercle Brugge 0
3 Belgium 2008/2009 KAA Gent 3
4 Belgium 2008/2009 FCV Dender EH 0
... ... ... ... ...
51953 Switzerland 2015/2016 FC Thun 0
51954 Switzerland 2015/2016 FC Luzern 3
51955 Switzerland 2015/2016 FC Sion 0
51956 Switzerland 2015/2016 FC Zürich 1
51957 Switzerland 2015/2016 FC Basel 0

51958 rows × 4 columns

Remove the teams in statistics dataframe for which there are no attributes.

In [60]:
teams_with_attributes = teams_attributes["team"].unique()
statistics = statistics[statistics["team"].isin(teams_with_attributes)]
statistics
Out[60]:
country season team points
0 Belgium 2008/2009 KRC Genk 1
1 Belgium 2008/2009 SV Zulte-Waregem 1
2 Belgium 2008/2009 KSV Cercle Brugge 0
3 Belgium 2008/2009 KAA Gent 3
5 Belgium 2008/2009 KV Mechelen 1
... ... ... ... ...
51953 Switzerland 2015/2016 FC Thun 0
51954 Switzerland 2015/2016 FC Luzern 3
51955 Switzerland 2015/2016 FC Sion 0
51956 Switzerland 2015/2016 FC Zürich 1
51957 Switzerland 2015/2016 FC Basel 0

51602 rows × 4 columns

Verify there are the same teams in both dataframes.

In [61]:
set(statistics["team"].unique()) == set(teams_attributes["team"].unique())
Out[61]:
True
In [62]:
statistics["points"].value_counts()
Out[62]:
3    19304
0    19193
1    13105
Name: points, dtype: int64
In [63]:
statistics.replace(to_replace={"points": {3: "won", 1: "drawn", 0: "lost"}}, inplace=True)
statistics
/tmp/ipykernel_131367/418106478.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[63]:
country season team points
0 Belgium 2008/2009 KRC Genk drawn
1 Belgium 2008/2009 SV Zulte-Waregem drawn
2 Belgium 2008/2009 KSV Cercle Brugge lost
3 Belgium 2008/2009 KAA Gent won
5 Belgium 2008/2009 KV Mechelen drawn
... ... ... ... ...
51953 Switzerland 2015/2016 FC Thun lost
51954 Switzerland 2015/2016 FC Luzern won
51955 Switzerland 2015/2016 FC Sion lost
51956 Switzerland 2015/2016 FC Zürich drawn
51957 Switzerland 2015/2016 FC Basel lost

51602 rows × 4 columns

In [64]:
statistics.rename(columns={"points": "game_status"}, inplace=True)
statistics
/tmp/ipykernel_131367/495333763.py:1: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[64]:
country season team game_status
0 Belgium 2008/2009 KRC Genk drawn
1 Belgium 2008/2009 SV Zulte-Waregem drawn
2 Belgium 2008/2009 KSV Cercle Brugge lost
3 Belgium 2008/2009 KAA Gent won
5 Belgium 2008/2009 KV Mechelen drawn
... ... ... ... ...
51953 Switzerland 2015/2016 FC Thun lost
51954 Switzerland 2015/2016 FC Luzern won
51955 Switzerland 2015/2016 FC Sion lost
51956 Switzerland 2015/2016 FC Zürich drawn
51957 Switzerland 2015/2016 FC Basel lost

51602 rows × 4 columns

Our main goal is to identify which attributes are more relevant. The first strategy to solve that is to apply a model that can predict the result of a match, and then analyze the coefficients for each attribute (in the case of the model, the attributes are going to represent the features model) and check which ones have more relevance.

But before that, let's analyze the distribution of the categorical attributes in the three types of game status: won, drawn, lost. After that, numerical attributes will also be analyzed.

In [65]:
categorical_features = [
    "buildUpPlaySpeedClass",
    "buildUpPlayDribblingClass",
    "buildUpPlayPassingClass",
    "buildUpPlayPositioningClass",
    "chanceCreationPassingClass",
    "chanceCreationCrossingClass",
    "chanceCreationShootingClass",
    "chanceCreationPositioningClass",
    "defencePressureClass",
    "defenceAggressionClass",
    "defenceTeamWidthClass",
    "defenceDefenderLineClass",
]
In [66]:
numerical_features = [
    "buildUpPlaySpeed",
    "buildUpPlayPassing",
    "chanceCreationPassing",
    "chanceCreationCrossing",
    "chanceCreationShooting",
    "defencePressure",
    "defenceAggression",
    "defenceTeamWidth",
]

Join the attributes and the results.

In [67]:
data = pd.merge(
    statistics,
    teams_attributes,
    on=["season", "country", "team"],
)
data
Out[67]:
country season team game_status team_api_id date buildUpPlaySpeed buildUpPlaySpeedClass buildUpPlayDribblingClass buildUpPlayPassing ... chanceCreationShooting chanceCreationShootingClass chanceCreationPositioningClass defencePressure defencePressureClass defenceAggression defenceAggressionClass defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass
0 Belgium 2009/2010 Standard de Liège drawn 9985 2010-02-22 53 Balanced Little 40 ... 65 Normal Organised 70 High 70 Double 65 Normal Cover
1 Belgium 2009/2010 Standard de Liège won 9985 2010-02-22 53 Balanced Little 40 ... 65 Normal Organised 70 High 70 Double 65 Normal Cover
2 Belgium 2009/2010 Standard de Liège drawn 9985 2010-02-22 53 Balanced Little 40 ... 65 Normal Organised 70 High 70 Double 65 Normal Cover
3 Belgium 2009/2010 Standard de Liège won 9985 2010-02-22 53 Balanced Little 40 ... 65 Normal Organised 70 High 70 Double 65 Normal Cover
4 Belgium 2009/2010 Standard de Liège drawn 9985 2010-02-22 53 Balanced Little 40 ... 65 Normal Organised 70 High 70 Double 65 Normal Cover
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
37515 Switzerland 2015/2016 BSC Young Boys lost 10192 2015-09-10 52 Balanced Normal 64 ... 46 Normal Organised 44 Medium 34 Press 50 Normal Cover
37516 Switzerland 2015/2016 BSC Young Boys won 10192 2015-09-10 52 Balanced Normal 64 ... 46 Normal Organised 44 Medium 34 Press 50 Normal Cover
37517 Switzerland 2015/2016 BSC Young Boys won 10192 2015-09-10 52 Balanced Normal 64 ... 46 Normal Organised 44 Medium 34 Press 50 Normal Cover
37518 Switzerland 2015/2016 BSC Young Boys won 10192 2015-09-10 52 Balanced Normal 64 ... 46 Normal Organised 44 Medium 34 Press 50 Normal Cover
37519 Switzerland 2015/2016 BSC Young Boys lost 10192 2015-09-10 52 Balanced Normal 64 ... 46 Normal Organised 44 Medium 34 Press 50 Normal Cover

37520 rows × 26 columns

The following plots are from the categorical variables and represent the proportion of the different categories of the attributes grouped by the result of the matches.

In [68]:
for feature in categorical_features:
    groups = data.groupby(feature, sort=False)

    figure_data = list()
    for group_name, values in groups:
        counts = values["game_status"].value_counts(normalize=True) * 100.0
        figure_data += [go.Bar(name=group_name, x=counts.index, y=counts, texttemplate="%{y:.1f}", textposition="outside")]

    fig = go.Figure(data=figure_data)
    fig.update_layout(
        title=feature,
        yaxis_title="percentage",
        barmode="group",
    )
    fig.show()

It is possible to see that there's no clear significant difference between the attributes and the results obtained.

Let's see what happens with numerical variables using histograms of the value's distribution.

In [69]:
for feature in numerical_features:
    fig = go.Figure()
    groups = data.groupby("game_status")
    for group_name, values in groups:
        fig.add_trace(go.Histogram(x=values[feature], name=group_name))

    fig.update_layout(
        barmode="overlay",
        title_text=feature,
        xaxis_title_text=feature,
        yaxis_title_text="counts",
    )
    fig.update_traces(opacity=0.75)

    fig.show()

The distribution of the numerical variables are pretty similar, therefore it is difficult to notice a significant difference between the 3 different outputs of matches.

Recalling the proposal of creating a model to predict the output of the matches, we can categorize the output as won, drawn and lost (because we are interested in the victories). A model that can be useful is Logistic Regression. This is a model used for classification problems, and given that we want to classify whether a team won, drew or lost a match, this is a good approach. Another thing is that once this model is trained, it is possible to analyze the regression coefficients in order to get the relevance of each attribute in the prediction.

Logistic Regression Model¶

The numerical features are used to train the model and the object variable is game_status

In [70]:
X = data.loc[:, numerical_features]
scaler = StandardScaler()
X = scaler.fit_transform(X)
y = data["game_status"]
In [71]:
model = LogisticRegression(multi_class="ovr")
model.fit(X,y)
Out[71]:
LogisticRegression(multi_class='ovr')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(multi_class='ovr')

Once the model is fitted with the numerical features, we can check how well it predicts the status of the matches.

In [72]:
predictions = model.predict(X)
Counter(predictions)
Out[72]:
Counter({'won': 19792, 'lost': 17728})

The predictions only return two out of three classes, and that is not a good sign because it is possible to see a lack of precision in the model.

Let's check the score which is the mean accuracy of the data.

In [73]:
model.score(X, y)
Out[73]:
0.41409914712153517

This score is better than a model that randomly assigns a class, because for that case we could get an accuracy of 0.333. Nevertheless, the model is just predicting two classes which is not good.

Let's check the regression coefficients.

The classes have the following order:

In [74]:
model.classes_
Out[74]:
array(['drawn', 'lost', 'won'], dtype=object)

The coefficients for class drawn are:

In [75]:
model.coef_[0]
Out[75]:
array([ 0.01834855,  0.01630169, -0.01604067, -0.01510765,  0.00953911,
       -0.04469597,  0.00571335,  0.01451332])

The coefficients for class lost are:

In [76]:
model.coef_[1]
Out[76]:
array([-0.0294302 ,  0.12081135, -0.01215875, -0.06227151, -0.02938927,
       -0.09400256, -0.01550754,  0.01593723])

The coefficients for class won are:

In [77]:
model.coef_[2]
Out[77]:
array([ 0.01394338, -0.13328622,  0.02334461,  0.07441526,  0.02213145,
        0.12898247,  0.01048978, -0.02713506])

The following are the coefficients converted to probabilities, and we can see that almost all are roughly the same for the three classes.

In [78]:
[np.exp(x)/(1+np.exp(x)) for x in model.coef_[0]]
Out[78]:
[0.5045870089957382,
 0.504075333013677,
 0.4959899189333752,
 0.4962231599566076,
 0.5023847604231396,
 0.48882786827775615,
 0.5014283343399757,
 0.5036282662412285]
In [79]:
[np.exp(x)/(1+np.exp(x)) for x in model.coef_[1]]
Out[79]:
[0.49264298046337285,
 0.5301661561357174,
 0.49696034984685244,
 0.4844371515151573,
 0.4926532116311339,
 0.4765166498049305,
 0.4961231917436247,
 0.5039842224328268]
In [80]:
[np.exp(x)/(1+np.exp(x)) for x in model.coef_[2]]
Out[80]:
[0.5034857875762043,
 0.4667276879921658,
 0.5058358867192281,
 0.5185952346280284,
 0.5055326355390368,
 0.5322009863934838,
 0.5026224203358294,
 0.49321665033619405]

Conclusions¶

From the European Soccer database I can conclude that even though there are teams at the top of everyone else during all seasons, that doesn't mean they are the teams that improved the most over time.

On the contrary, the teams that improved the most are KAA Gent from Belgium Jupiler League, Bayer 04 Leverkusen from Bundesliga in Germany and Sunderland from the England Premier League. And one thing that's so interesting is that the teams I found decreased their improvement the most over time are FC Bayern Munich from Germany, Ajax from Netherlands and Atlético de Madrid from Spain even though they are top clubs around the world.

Regarding the attributes of the teams that could lead to the most victories, unfortunately, it was not possible to discover something relevant on that part. The attributes are evenly distributed over all the clubs and are not as decisive as would have been expected.

One limitation that should be considered for this particular analysis is that, for the table matches, there is no data related to the players who participated in each game. For that reason, it became difficult to predict which players and what attributes contributed to the most victories. Likewise, another relevant limitation is the absence of metadata which led to the reduced range of analysis since it was hard to interpret the data itself and therefore it was hard to know the contribution and interaction between them.

The characteristics of the team attributes along with the two limitations explained before undermined a deeper and more meaningful analysis intended to, for example, a decision-making process associated with improving performance as a team.

References¶

  • For plots using ployly
  • How to Choose a Feature Selection Method For Machine Learning
  • How to Calculate Feature Importance With Python
  • One-vs-Rest and One-vs-One for Multi-Class Classification
  • ScikitLearn Logistic Regression
  • How to interpret multiclass logistic regression coefficients?
  • Logistical Regression II - Multinomial Data